OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report
Identifieur interne : 002588 ( Main/Exploration ); précédent : 002587; suivant : 002589OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report
Auteurs : XIANG TONG [États-Unis] ; CHENGXIANG ZHAI [États-Unis] ; N. Milic-Frayling [États-Unis] ; D. A. Evans [États-Unis]Source :
- NIST special publication [ 1048-776X ] ; 1997.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
In CLARIT TREC-5 confusion track experiments, they explored two techniques for improving retrieval performance over corrupted data : (1) OCR word error correction to improve OCR text accuracy, and (2) query expansion by adding query term variants found in the corrupted text. The OCR word correction technique is based on statistical word bigram modeling (Tong & Evans 1996). The variants of a query term are terms similar to the query term, as measured by the edit distance (Wagner 1974). While the official runs were based on the first approach, in the follow-up experiments they tested the second approach as well. In this report, they give a brief description of the OCR correction and query expansion techniques, and then discuss the results of the experiments
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000885
- to stream PascalFrancis, to step Curation: 000B12
- to stream PascalFrancis, to step Checkpoint: 000893
- to stream Main, to step Merge: 002724
- to stream Main, to step Curation: 002588
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report</title>
<author><name sortKey="Xiang Tong" sort="Xiang Tong" uniqKey="Xiang Tong" last="Xiang Tong">XIANG TONG</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Chengxiang Zhai" sort="Chengxiang Zhai" uniqKey="Chengxiang Zhai" last="Chengxiang Zhai">CHENGXIANG ZHAI</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Milic Frayling, N" sort="Milic Frayling, N" uniqKey="Milic Frayling N" first="N." last="Milic-Frayling">N. Milic-Frayling</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Evans, D A" sort="Evans, D A" uniqKey="Evans D" first="D. A." last="Evans">D. A. Evans</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">98-0270910</idno>
<date when="1997">1997</date>
<idno type="stanalyst">PASCAL 98-0270910 INIST</idno>
<idno type="RBID">Pascal:98-0270910</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000885</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B12</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000893</idno>
<idno type="wicri:doubleKey">1048-776X:1997:Xiang Tong:ocr:correction:and</idno>
<idno type="wicri:Area/Main/Merge">002724</idno>
<idno type="wicri:Area/Main/Curation">002588</idno>
<idno type="wicri:Area/Main/Exploration">002588</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report</title>
<author><name sortKey="Xiang Tong" sort="Xiang Tong" uniqKey="Xiang Tong" last="Xiang Tong">XIANG TONG</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Chengxiang Zhai" sort="Chengxiang Zhai" uniqKey="Chengxiang Zhai" last="Chengxiang Zhai">CHENGXIANG ZHAI</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author><name sortKey="Milic Frayling, N" sort="Milic Frayling, N" uniqKey="Milic Frayling N" first="N." last="Milic-Frayling">N. Milic-Frayling</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Evans, D A" sort="Evans, D A" uniqKey="Evans D" first="D. A." last="Evans">D. A. Evans</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">NIST special publication</title>
<title level="j" type="abbreviated">NIST spec. publ.</title>
<idno type="ISSN">1048-776X</idno>
<imprint><date when="1997">1997</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">NIST special publication</title>
<title level="j" type="abbreviated">NIST spec. publ.</title>
<idno type="ISSN">1048-776X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automated processing</term>
<term>Automatic correction</term>
<term>Data</term>
<term>Degradation</term>
<term>Information retrieval</term>
<term>Optical character recognition</term>
<term>Query</term>
<term>Question processing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Recherche information</term>
<term>Reconnaissance optique caractère</term>
<term>Correction automatique</term>
<term>Question documentaire</term>
<term>Traitement automatisé</term>
<term>Dégradation</term>
<term>Donnée</term>
<term>CLARIT</term>
<term>Elargissement question</term>
<term>Traitement question</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In CLARIT TREC-5 confusion track experiments, they explored two techniques for improving retrieval performance over corrupted data : (1) OCR word error correction to improve OCR text accuracy, and (2) query expansion by adding query term variants found in the corrupted text. The OCR word correction technique is based on statistical word bigram modeling (Tong & Evans 1996). The variants of a query term are terms similar to the query term, as measured by the edit distance (Wagner 1974). While the official runs were based on the first approach, in the follow-up experiments they tested the second approach as well. In this report, they give a brief description of the OCR correction and query expansion techniques, and then discuss the results of the experiments</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Pennsylvanie</li>
</region>
<settlement><li>Pittsburgh</li>
</settlement>
<orgName><li>Université Carnegie-Mellon</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Pennsylvanie"><name sortKey="Xiang Tong" sort="Xiang Tong" uniqKey="Xiang Tong" last="Xiang Tong">XIANG TONG</name>
</region>
<name sortKey="Chengxiang Zhai" sort="Chengxiang Zhai" uniqKey="Chengxiang Zhai" last="Chengxiang Zhai">CHENGXIANG ZHAI</name>
<name sortKey="Evans, D A" sort="Evans, D A" uniqKey="Evans D" first="D. A." last="Evans">D. A. Evans</name>
<name sortKey="Milic Frayling, N" sort="Milic Frayling, N" uniqKey="Milic Frayling N" first="N." last="Milic-Frayling">N. Milic-Frayling</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002588 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002588 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:98-0270910 |texte= OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report }}
This area was generated with Dilib version V0.6.32. |